-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve smart read #51
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #51 +/- ##
==========================================
- Coverage 92.86% 92.81% -0.06%
==========================================
Files 11 11
Lines 2635 2629 -6
Branches 718 720 +2
==========================================
- Hits 2447 2440 -7
- Misses 84 85 +1
Partials 104 104
☔ View full report in Codecov by Sentry. |
Thank you so much! It is certainly cleaner and I will merge it. I mostly understand what you are doing, but I am still wondering where the speed-up came from. It seems like you are iterating from the first chunk until every single index is found. Wouldn't that make it slower when the indices appear pretty "late" in a large dataset? Yet, it is faster. Why is that? Yes, I will change the docstring. |
I'm not sure I understand. How do we know that we can skip a chunk if we don't look at it? |
I was using the indexes to bin the particles first. The key step was To clarify, even though we are not reading data, Now that I think about it, the main difference in performance might be because of the overhead from |
Good point. Let me check if running a separate loop with unique values is faster. |
I tried the code below, but it was a little slower than the old function.
It needs some thoughts/benchmarking, so I'll leave to you to decide what's better! def smart_read(da, indexes_tuple, dask_more_efficient=100):
if len(indexes_tuple) != da.ndim:
raise ValueError(
"indexes_tuple does not match the number of dimensions: "
f"{len(indexes_tuple)} vs {da.ndim}"
)
shape = indexes_tuple[0].shape
indexes_tuple = tuple(indexes.ravel() for indexes in indexes_tuple)
if not da.chunks:
return da.values[indexes_tuple].reshape(shape)
data = da.data
unique_indexes, inverse = np.unique(
np.stack(indexes_tuple, axis=0), axis=1, return_inverse=True
)
found_count = 0
block_dict = {}
for block_ids in np.ndindex(*data.numblocks):
shifted_indexes = []
mask = True
for block_id, indexes, chunks in zip(block_ids, unique_indexes, data.chunks):
shifted = indexes - sum(chunks[:block_id])
block_mask = (shifted >= 0) & (shifted < chunks[block_id])
if not block_mask.any() or not (mask := mask & block_mask).any():
break # empty block
shifted_indexes.append(shifted)
else:
block_dict[block_ids] = (mask, shifted_indexes)
if len(block_dict) >= dask_more_efficient:
return data.vindex[indexes_tuple].compute().reshape(shape)
if (found_count := found_count + mask.sum()) == unique_indexes.shape[1]:
break # all blocks found
values = np.empty(unique_indexes.shape[1])
for block_ids, (mask, shifted_indexes) in block_dict.items():
block_values = data.blocks[block_ids].compute()
values[mask] = block_values[tuple(indexes[mask] for indexes in shifted_indexes)]
return values[inverse].reshape(shape) |
OK, thanks. This is not quite how I used unique, but I guess this part of the code is too opaque. I will take over from here, push to your branch and see what I can do with it. |
Sounds good. I pushed just a little optimization, I was storing too much stuff in the dict. BTW, if you realise that your method was better, feel free to close! |
I have added a mode that try to cutout an array first using the maximum and minimum indexes first, as long as this is not too large. In this way we don't have to cut out multiple chunks one by one. It is usually faster to cut one large piece than cutting several small pieces |
Slicing is as good idea, but I think the code was getting a little to messy. I didn't get the goal of You raised a good issue before, masking could be quite slow or even crash with a lot of particles. I tried dask, it's looking good on my benchmark: for power in range(8):
size = 10 ** power
print(f"{size=:e}")
indexes_tuple = tuple(np.random.randint(n // 3, (n // 3) * 2, size=size) for n in (10, 1251, 701))
%timeit old_smart_read(da, indexes_tuple)
%timeit new_smart_read(da, indexes_tuple)
print()
Can you try on real data? If that works well, should we make dask a mandatory dependency? |
BTW, dask only chunks for size>10.E7, so the example above is not showing the use of dask. |
|
Yes, I think I had the same before, with an additional check (I used to check both At the end mask must be a a 1D-array, so we either do that, or we store 3 masks separately and we combine at the end. |
Did some benchmarking on the sciserver, which is not a great place to do benchmarking. I changed one line.
to move the indexes further behind.
Looks good. |
This is using dask:
|
This
smart_read
function is very interesting, a played a little with it today.Here is an alternative implementation. I removed
memory_chunk
because it looks a little risky: What happen if you have large chunks? You can get the same behavior by passingda.compute()
rather thanda
.It would be nice to parametrise a little better the switch that controls the use of vectorised indexing, but I wouldn't know how to do it :)
Even if you don't merge this, I suggest to improve a little the docstring explaining what's going on.
Try to do it fast and smartly.
is not very helpful, try to explain the idea behind this function.Here is a quick and dirty benchmark, it would be nice to test with real data, maybe this is just optimised for this specific case: